Text Content Filtering Based on Chinese Character Reconstruction from Radicals

نویسندگان

  • Wenlei He
  • Gongshen Liu
  • Jun Luo
  • Jiuchuan Lin
چکیده

Content filtering through keyword matching is widely adopted in network censoring, and proven to be successful. However, a technique to bypass this kind of censorship by decomposing Chinese characters appears recently. Chinese characters are combinations of radicals, and splitting characters into radicals pose a big obstacle to keyword filtering. To tackle this challenge, we proposed the first filtering technology based on combination of Chinese character radicals. We use a modified Rabin-Karp algorithm to reconstruct characters from radicals according to Chinese character structure library. Then we use another modified Rabin-Karp algorithm to filter keywords among massive text content. Experiment shows that our approach can identify most of the keywords in the form of combination of radicals and yields a visible improvement in the filtering result compared to traditional keyword filtering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Analysis of Radicals-based Features in Subjectivity Classification on Simplified Chinese Sentences

Chinese radicals are linguistic elements smaller than Chinese characters1. Normally, a radical is a semantic category and almost all characters contain radicals or are radicals themselves. In subjectivity classification on sentences, we can use radicals to represent characters, which reduce the scale of word space while keep the subjectivity information. In this paper, we manually labeled a cha...

متن کامل

An Assessment of character-based Chinese News Filtering Using Latent Semantic Indexing

We assess the Latent Semantic Indexing (LSI) approach to Chinese information filtering. In particular, the approach is for Chinese news filtering agents that use a character-based and hierarchical filtering scheme. The traditional vector space model is employed as an information filtering model, and each document is converted into a vector of weights of terms. Instead of using words as terms in...

متن کامل

Recent Results of Online Japanese Handwriting Recognition and Its Applications

This paper discusses online handwriting recognition of Japanese characters, a mixture of ideographic characters (Kanji) of Chinese origin, and the phonetic characters made from them. Most Kanji character patterns are composed of multiple subpatterns, called radicals, which are shared among many (sometimes hundreds of) Kanji character patterns. This is common in Oriental languages of Chinese ori...

متن کامل

Distributional Similarity for Chinese: Exploiting Characters and Radicals

Distributional Similarity has attracted considerable attention in the field of natural language processing as an automatic means of countering the ubiquitous problem of sparse data. As a logographic language, Chinese words consist of characters and each of them is composed of one or more radicals. The meanings of characters are usually highly related to the words which contain them. Likewise, r...

متن کامل

RAN: Radical analysis networks for zero-shot learning of Chinese characters

Chinese characters have a huge set of character categories, more than 20,000 and the number is still increasing as more and more novel characters continue being created. However, the enormous characters can be decomposed into a few fundamental structural radicals, only about 500. This paper introduces the Radical Analysis Networks (RAN) that recognize Chinese characters by identifying radicals ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010